-
Notifications
You must be signed in to change notification settings - Fork 219
[WIP] [Performance Improvement] Fine-granularity locking in stream_ordered_memory_resource #1912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
[WIP] [Performance Improvement] Fine-granularity locking in stream_ordered_memory_resource #1912
Conversation
To the reviewers: Thank you for your feedback! As an external contributor, I may have overlooked some details, and this is still a draft. That said, I’ve already distinguished between hot and cold paths in the design, implemented thread safety for this part, and passed unit tests locally. The draft currently focuses on small details and TODO on my side, including: 1) Protecting A detailed description
Although this PR is still WIP, I've built it with PTDS and run local testing. To unit-testing, I increased the thread count from 4 to 16 in the I welcome feedback on the current implementation. Most of my effort has gone into understanding the existing codebase, profiling bottlenecks, and debugging the solution. As I'm still learning the RMM codebase, thorough code reviews would be particularly valuable. I can also change this to a PR back for running CI tests 😄 Comments to lock-free mapRegarding a lock-free map (as discussed with @vyasr at #1887), I think that such an implementation is unnecessary for several reasons:
|
…t has its own mutex Signed-off-by: Jigao Luo <jigao.luo@outlook.com>
d056d6f
to
e0633c4
Compare
Performance (in libcudf) with this PRI benchmarked this PR in libcudf to read Parquet files with 32 concurrent streams, mirroring the setup in the issue #1887 (comment) . While the RMM memory allocator in the row of "librmm" now shows negligible overhead (single-digit microseconds per call, a huge improvement 🚀 ), and mutex contention from rmm in the row "OS runtime lib" has been eliminated mostly, the end-to-end speedup is modest due to unexpected increasing CUDA synchronization costs in the row of "CUDA API". But why more cudaSync time? 🤔Profiling figure above reveals cudaSynchronization overhead in the row of "CUDA API", which directly constrains end-to-end speedup for libcudf Parquet reading. The root cause in my understanding, as I also hinted in #1887 (comment), is following:
Short conclusionWhile this PR resolved a critical bottleneck in the RMM mutex, end-to-end speedup in libcudf Parquet reading is likely capped by deeper dependencies between asynchronous kernel and synchronous memory allocation. Realizing that substantial speedup in libcudf should be more challenging than I initially anticipated, as I also misidentified the critical bottleneck. The primary constraint lies not in the rmm mutex but in kernels of libcudf. |
log_summary_trace(); | ||
read_lock_guard rlock(stream_free_blocks_mtx_); | ||
// Try to find a satisfactory block in free list for the same stream (no sync required) | ||
auto iter = stream_free_blocks_.find(stream_event); | ||
if (iter != stream_free_blocks_.end()) { | ||
// Hot path | ||
lock_guard free_list_lock(iter->second.get_mutex()); | ||
iter->second.insert(block); | ||
} else { | ||
rlock.unlock(); | ||
// Cold path | ||
write_lock_guard wlock(stream_free_blocks_mtx_); | ||
stream_free_blocks_[stream_event].insert(block); // TODO(jigao): is it thread-safe? | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential race here, I think.
Suppose that two threads are using the same stream, they both enter line 259 to search for the stream_event
in stream_free_blocks_
. That key doesn't exist so they go down the cold path on line 266.
Both unlock the mutex, and go ahead and try to grab the write lock. Only one of the threads can win, call this thread-A. Meanwhile thread-B waits to acquire the lock.
Now thread-A inserts its free_block
, and continues, unlocking the mutex.
Now thread-B comes in to insert its block, but the key already exists, and so stream_free_blocks_[stream_event].insert(block)
does nothing.
So we drop the reference to block
from thread-B without returning to the pool, breaking the contract that free_block
has:
/**
* @brief Finds, frees and returns the block associated with pointer `ptr`.
*
* @param ptr The pointer to the memory to free.
* @param size The size of the memory to free. Must be equal to the original allocation size.
* @return The (now freed) block associated with `p`. The caller is expected to return the block
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* to the pool.
*/
block_type free_block(void* ptr, std::size_t size) noexcept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- Thanks for reviewing.
, and so stream_free_blocks_[stream_event].insert(block) does nothing.
Could you explain why thread B does nothing in this case? I had the same case in mind and had the same concern. But what will happen in my mind is:
- (Thread A and thread B must have different
block
from different pointers.) - After thread B acquires the write lock, it must see thread A's memory writes to
stream_free_blocks_
. The map has a new entry inserted by thread A. - This implies that
stream_free_blocks_[stream_event]
in thread B should return a valid iterator pointing to the createdfree_list
. - Finally, thread B inserts its
block
into this list.
The conclusion I have is: Once a write lock is held on the map, operating on the map and its contained free_lists is thread-safe. Is there a flaw in this?
If the following code is preferable to line 270, I'd be happy to replace:
auto iter = stream_free_blocks_.find(stream_event);
free_list& blocks =
(iter != stream_free_blocks_.end()) ? iter->second : stream_free_blocks_[stream_event];
lock_guard free_list_lock(blocks.get_mutex());
blocks.insert(block);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I thought this insert
was calling the insert
method on the std::map
that is stream_free_blocks_
. However, you're right, it's calling insert
on the free_list
that you get back from the map lookup. So I think this implementation is thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence —Thanks for the confirmation! But I will still replace line 270 as discussed for better code style. The write locks on free_list
are acquired in a consistent order, and there should be no deadlock
I'll replace line 270 with this code section and add clarifying comments. I've already run unit tests with 64 threads on my end, and the results are stable 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- I've refactored do_deallocate
for better readability using early returns while keeping the same logic as we discussed. I also added more comments explaining the hot/cold path ideas. My commit passes local unit tests with 64 threads.
Signed-off-by: Jigao Luo <jigao.luo@outlook.com>
@@ -253,9 +256,60 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_ | |||
// streams allows stealing from deleted streams. | |||
RMM_ASSERT_CUDA_SUCCESS(cudaEventRecord(stream_event.event, stream.value())); | |||
|
|||
stream_free_blocks_[stream_event].insert(block); | |||
read_lock_guard rlock(stream_free_blocks_mtx_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a lot of complexity to do_deallocate
that I think should be in the free list implementation. I originally designed this to be portable to other free list implementations, which is why this function was originally so simple -- it more or less just called insert
on the stream's free list.
This allocator is already quite fast, and I think in your exploration ultimately you found that it's not the actual bottleneck? Is it worth adding so much complexity? If the complexity can be put in the free list it might be better.
@JigaoLuo apologies for the delay. I haven't quite found time to look at this PR yet, but I promise that I will get into it soon! |
This comment was marked as off-topic.
This comment was marked as off-topic.
I’ve conducted performance benchmarks using static void benchmark_range(benchmark::internal::Benchmark* bench)
{
bench //
->RangeMultiplier(2)
->Ranges({{1, 32}, {4, 32}, {false, true}}) // num_streams, num_kernels, do_prewarm
->Unit(benchmark::kMicrosecond);
} The benchmark shows significant speedup (up to 15x) in high-concurrency scenarios:
(While these RMM changes may not directly or easily boost libcudf’s end-to-end performance, they isolate synchronization gains and hint at broader optimization potential. This side study in RMM is a part of my story issue in libcudf rapidsai/cudf#18892, and major speedups in libcudf may require deeper integration.) |
What is |
Hi @harrism, rmm/cpp/benchmarks/multi_stream_allocations/multi_stream_allocations_bench.cu Lines 54 to 61 in a828642
Before the benchmark timer, the With |
This PR optimizes
stream_ordered_memory_resource
with fine-grained locking. Replace the monolithic mutex with dedicated locks at fine-granularity. Differentiate hot/cold paths to reduce synchronization overhead. Fixes severe CPU contention, delaying GPU kernels due to waiting for memory.Description
Try to close #1887 which is my performance observation.
Checklist